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ABSTRACT 


Information may be stored in the memory or the memory exten- 
sions of digital computers, The ease with which this information 
may be manipulated and retrieved is greatly complicated if it is 
of the non-numeric type. However, techniques are known which 
facilitate the solutions to these information storage and retrieval 
problems, These techniques allow the optimization of efficiency 
according to one or more criteria, Various list languages are 
introduced and the overall information storage and retrieval 
problem is outlined. Four basic storage allocation systems are 
discussed and their relative merits compared, Tree structures 
embodying the best features of the four basic systems are described 
and evaluated, Efficiency computations from the standpoint of 
search time and the search-time, storage-space product are pre- 
sented, A specific tree structure called a trie is introduced 
as a compromise between the two extremes of the balanced tree and 
the fully elided unbalanced tree. Finally, a technique called 
Fibonaccian searching is presented as a method to conserve both 


storage space and time. 
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by Introduction. 

It is the purpose of this paper to bring together in a single 
document a description of some of the schemes which have been pro- 
posed in the field of Information Storage and Retrieval. Further, 
it is intended that the emphasis be on the efficiencies of the 
various schemes and how these efficiencies are modified by inter- 
action during combination. 

Automated information storage and retrieval has for some time 
been a dream of scientists and engineers, A most provocative article 
was written by Vannevar Bush just after World War II describing the 
"library problem" and a proposed solution. [1] Most of what was 
proposed could then only be classified as speculation but we have 
seen that this speculation was well based. His proposed private file 
and library, which he named the "memex", is within the capabilities 
of our technology today. The bar to its realization is the com- 
plexity of the indexing and manipulation of non-numerical information, 
Much work has been done on the problem, but we are still a long way 
from a completely automated system, It is not suggested that what 
follows here is an exhaustive compilation of the state-of-the-art, 

It is rather a sampling to indicate the current status of some of 
our attempts at completely mechanized information storage and retrieval. 

It is intended that Section 2 will provide an introduction to 
a widely used device for manipulation of non-numeric information, 
i.e. the list. Also it briefly describes several computer languages 
which use lists, Section 3 describes the information storage and 


retrieval problems of 1) automatic determination of document 





content, 2) organization of structured information within a 
computer, and 3) evaluation of systems, Specific file processing 
operations are discussed in Section 4 in order to provide a common 
basis for evaluation of the storage allocations described in 
Sections 5 and 6. Sections 5 and 6 contain programs which describe 
the searches and are intended to convey a feeling for the type and 
amount of processing required, Section 7 shows how it is possible 
to design trees to minimize the time necessary to search them while 
at the same time holding the tree's storage requirements to a 
minimum. A comparison is made in Section 9 between the storage 
requirements of trees representing both extreme and mean types. 
Finally, an efficient method of searching an ordered list in a 


machine not capable of a binary shift is given in Section 9, 





2 Hueristic programming and the development of list languages. 

The application of digital computers has been most successful 
thus far in areas where the manipulation of numeric quantities is 
the primary operation. More recently, attempts are being made to 
handle more complex and ill structured problems such as theorem 
proving, playing chess and checkers, or performing various symbolic 
calculations like differentiation and integration. The motivations 
behind this work mania from a desire to extend the capabilities of 
computers to the desire to understand how humans think, learn, and 
solve problems. [24] 

In a non-numeric problem the contents of words in memory are not 
treated as though they contained numbers (although in fact they do). 
Rather, the contents are treated as alphanumeric strings of characters 
or symbols. For example, they might be algebraic expressions, coded 
information, or English words, For this reason the term symbol 
manipulation is sometimes used to characterize the procedures 
used, [35] 

A fairly common characteristic of a nonnumerical problem is 
that there is no algorithm or standard procedure for solving it. 
More precisely, the algorithms needed for solution are very complex 
--- if they are known to exist at all. Im addition, there is a 
need for a unit of data larger than a single number, and the data 
must be able to change in both structure and content as the problem 


is solved, 





A solution to this problem has been found in the list. A list 
may be defined as an ordered set of tokens, connected in a structure 
such that at least one succaeaiee to each token can be identified. 

[35] In some list structures, predecessors of tokens can be 
similarly identified, A simple example of a list is an ordered set 
of integers. The successor of one integer is the "next larger" 
integer. A more complex example is illustrated in Figure 1. [In 
this example (taken from [35] ) the address of each word in the 
list is stored in the word that precedes it in the list, Each word 
in the list may contain two addresses; one is the address of the 
item that belongs at that point, and the other is the address of the 
next word in the list. The latter address is the pointer address, 
The numbers outside the boxes are location addresses for the boxes. 
The list tokens indicated in this example comprise the symbolic 


expression: 


(NAME -LIST/WORD)+ABC 
Note that the order of the tokens is given by the pointer addresses; 
the tokens need not be listed in memory in sequence, The pointer 
address 00000 signals the end of the list. 

With a list structure, the insertion or deletion of tokens is 
relatively simple. For example, to insert the subexpression +DIGIT 
immediately after NAME in the expression in Figure 1, the pointers 
are modified as shown in Figure 2. Two words were added to the 
list; only one pointer already in the list was changed (from 02205 
to 02221), and everything else in the list was unchanged. Two 


tokens were added to the block at 04000. Deletion of tokens is also 
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simple; pointers are modified, and nothing else is done. There will 
be extraneous information in words that were once part of the list, 
but since pointers bypass these words, no reference is made to them. 

As lists are processed, many are replaced by others or are 
for some reason of no further use. If the space taken by such lists 
is never again used, the available space within the computer may 
eventually be depleted. The words no longer needed are scattered 
all over memory, and a means is required for determining where they 
are, This is accomplished by forming a list of available space (LAS). 
In this list each word points to a successor and to an available 
word. This list has the same structure as the list described. If 
the scheme is to function properly, each list operation must supply 
to the LAS those words made free by the operation, Thus at any time 
all free space is linked into one list, The programmer is free of 
worry over the space, 

Several languages have been developed during the past few 
years which use lists and list structures as a fundamental working 
unit. A brief review of this work illustrates some of the applica- 
tions of list processing languages and the various ways in which 
list processing can be introduced into programming. 

The work of Newell, Shaw, and Simon began in late 1954, trig- 
gered by the pioneering work of Selfridge and Dineen on a program 
for recognizing visual patterns, [4] [34] They first worked 
on chess and then switched to the task of proving theorems in the 
propositional calculus of Whitehead and Russell. [26] They 


originally devised some languages that were tied closely to subject 
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matter---a "chess language" and a "logic language". These languages, 
collectively called IPL-I (Information Processing Language-I), 
although designed as pseudo-codes, never reached the coded stage 

(see [25] for description). [24] [39] [27] The IPL series, 

I through VI, was developed for use with symbolic logic programs and 
hueristic programs such as chess programs, programs for use in 
management (e.g. assigning tasks to work stations for an assembly 
line) and a program called GPS, for General Problem Solver, which 

was an effort to simulate human behavior. 

H. Gelernter has developed a program for proving theorems in 
plane geometry. [9] fil) To do this he and his colleagues 
developed a list processing adjunct to FORTRAN for the 7/04 called 
FLPL for FORTRAN List Processing Language. f10] This was done 
by adding a series of subroutines to the FORTRAN system, FLPL uses 
lists only for data, since it uses FORTRAN produced machine code 
for routines. 

J. McCarthy has developed another list language for the 704 
called LISP, for List Processor. Ee, This one, like IPL-V, 
uses lists for both routines and data. Externally, LISP uses a 
horizontal notation in which list structures are represented with 
the aid of parentheses. The identity of parenthetical notation 


and list structures can be seen from the following figure: 


(A,B, (C,D)) 
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Part of the reason for the development of LISP is McCarthy's work 
with M. Minsky on heuristic programs, There has also been some 
work done on analytic differentiation and integration of elementary 
functions in LISP, 

Another list processing language which was designed to retain 
the advantages of the preceding languages while simplifying machine 
processing of lists is called a Threaded List. It was developed 
by A. J. Perlis and Charles Thornton. [29] A threaded list is 
a list structure in which the last element of each list specifies 
the location of the head of the list of which it is the terminal 
Mesber. The advantage of this structure is that it permits the de- 
finition of various modes for sequencing through lists without 
requiring the use of the usual push-down lists for retaining 
sequencing information. This corresponds, in the representation 
of programs by list structures, to the representation of all com- 
putable functions iteratively rather than recursively. 

Weizenbaum developed the KLS (Knotted List Structure) language | 
while considering the relationship between computer languages and 
computer organizations, and the hardware-software tradeoffs which 
this relationship makes possible, [39] Later he developed SLIP 
(Symetric List Processor) which is a descendent of FLPL, IFL-V, 
Threaded Lists, and his own KLS. In this system each list cell 
carries both a forward and a backward link as well as a datum 
(token of information). It is symetric in the sense that its lists 
do not have a preferred orientation. Any operation which can be 


carried out on the top of a list can just as easily be carried out 





on the bottom, It is a language system designed to be imbedded in 
a higher order language capable of calling machine language sub- 
routines (like FLPL). [40] 

The Philco Company has implemented the IPL-V language on their 
2000 computer as a set of macro-operations, subroutines and con- 
ventions supplementing TAC (Translator-Assembler-Compiler, the 
assambly language for the 2000). These macros, subroutines, and 
conventions will be referred to as TALL (TAC List Language), TALL 
uses the loading facilities of TAC, the IPL-V primitive processes, 
and a set of subroutines performing the work of the interpreter. 
The macros aid in the translation from IPL-V to TAC. The macros 
and the primitive processes can be placed on the TAC subroutine 
library tape and called in as required during assembly. 

The implementation of IPL-V in this fashion has severai ad- 
vantages; (1) the time required to get a basis IPL-V system running 
on the 2000 is only three man-weeks; (2) symbolic machine language 
instructions can easily be inserted into TALL programs; (3) IPL-V 
statements can be used in conjunction with FORTRAN statements or 
JOVIAL statements; and (4) no additional work is required to make 
TALL compatible with any monitor system for the 2000. pot 

Compilers for one or more list processing languages exist 
today for every major general purpose computer. 

In recognition of the fact that not all problems are purely 
of the symbol manipulation type, it might be well to mention a 
technique for symbol manipulation and numerical calculation. Ross 


has developed a type of generic structure which he calls a plex. [31] 


10 





There exist problems which are purely numerical, thereby not requir- 
ing symbol manipulation techniques, and there also exist problems 
which are purely symbolic, thereby not requiring efficient handling 
of data for computation, But there is a much larger class of 
problems in which both symbolic manipulation and numerical calcula- 
tion are required and are inextricably intertwined, The plex 
structure is designed to handle this latter type. A claim is made 
that this structure is more powerful than the list structure (it 
includes lists as a subcase) and appears to be better suited for 

the concise representation of the complex interrelations of elements 


which constitute a "problem". [31] Note: The word plex is an 


abbreviation of the word plexus: "An interwoven combination of 
parts of a structure; a network," ----Webster. 


ye 





in, 


Se, Information storage and retrieval, 

One important subset of the set of non-numeric applications of 
the digital computer is information storage and retrieval, The most 
recent and profitable improvements in the capabilities of digital 
computers have been the result of software innovations, i. e. 
assembly programs, subroutines, compilers or translators, etc. 
However, as the result of work in the symbol manipulation area, 


certain hardware specifications have been shown desireable, These 


% 


¢ 


may be classified into two main groups, 1) those which apply to the 
modern random access magnetic core type memory, and 2) those which 
apply to the new associative type memories, Typical of the refer- 
ences of the former are [41] and za) , and of the latter 

[30] and [28] . It is not the intent of this paper to become 
involved in detailed hardware comparisons, On the other hand, most 
of the discussion that follows is concerned with software solutions 
to problems related to storage and manipulation of structured in- 
formation within the magnetic core, random access, type computers 
(the most common in-use general purpose computers). A word of warning 
to the reader is appropriate at this point. The literature on this 
subject often uses the term "associative memory" without distinc- 
tion between the true associative memories and associative memories 

3 
implemented in present day addressable memories by simulation | 
techniques, e. g. see [17_] 
Most of what has been written regarding information storage 


and retrieval starts with the very convenient assumption that the 


information has previously been formatted, keyed, and stored. [32] 
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This does not mean, however, that the subject of automatic determina- 
tion of content of documents has been completely neglected. There 
have been Several attacks on the problem, 

A well known method involves a frequency count of words in the 
text; those words which occur most frequently are judged to be the 
most significant and are used as indicators of the content of the 
document, A promising variant to this method has been proposed which 
utilizes only the initial occurrences of nouns of the text to detect 
document content. The results indicate that the proposed method may 
be useful for both the automatic indexing of documents and the pro- 
duction of automatic abstracts. [13] 

A completely different approach to automatic indexing and 
classification is the use of bibliographic references provided with 
the document as quasi-index terms to improve indexing methods, An 
outstanding advantage of this method is that processing the full text 
of the document is not required, Although the formats of biblio- 
graphic references may vary in different documents, there has been 
some success at using a computer program to put these various forms 
into a standard form, so that they may then be processed by another 
program, 

Another class of techniques to aid in the automatic determina- 
tion of document content utilizes a syntactic analysis of some of 
the sentences of the document, In a retrieval system of this type 
there are numerous alternatives available in the degree and 


thoroughness of analysis. 
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The use of the semantic values of the words of documents has 
been used by several researchers in the area of data retrieval, 
There are difficulties in this approach as might be expected when 
different authors naturally use different words to express similar 
ideas, These semantic relations may also be supported by a syn- 
tactic analysis as mentioned above. 

Another area of information storage and retrieval which has 
been thoroughly investigated is the organization of structured in- 
formation in a computer and the effects of various allocations 
systems upon search strategies and matching procedures, Three basic 
formats --- vector, tree, and graph --- are encountered in the 
literature for the representation of structured information, Although 
the three forms are closely related, it is often convenient to 
consider them separately when analyzing characteristics of systems 
associated with them. 

During the past several years, techniques for the evaluation 
of information storage and retrieval systems have been developed. 
Swets has reviewed ten different measures for evaluation and 
proposes another based on statistical decision theory. [38] 

The various measures reviewed have much in common, Eight of 
them evaluate only the effectiveness (accuracy, sensitivity, dis- 
crimination) of a retrieval system and are derived completely, in 
one way or another, from the two-by-two contingency table of 
pertinence and retrieval represented in Fig. 3. The other three 
measures assess efficiency as well as effectiveness by including 


such performance factors as time, convenience, operating cost, 
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and product form. The questions of effectiveness and efficiency 


are difficult to answer and care must be taken in any comparative 


analysis, [38] [20] [21] 








Figure 3. 
The Two-by-Two contingency table of pertinence and retrieval, 
P and F denote, respectively, pertinent and nonpertinent items; 
R and R denote, respectively, retrieved and unretrieved items; 
a, b, c, and d represent the simple or weighted frequencies of 
occurrence of the four conjunctions; V, is the value of retrieving 


a pertinent item; V, is the vaiue of not retrieving a nonpertinent 


2 


is the cost of retrieving a nonpertinent item; anc K, is 


item; K D 


t 
the cost of failing to retrieve a pertinent item. [387] 
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‘a File processing operations, 

One of the major considerations when deciding wpon a particular 
Storage allocation scheme for use in an information processing 
system is its effectiveness and/or efficiency, In order to make 
meaningful comparisons between various schemes considered, it is 
necessary to use parameters and file processing operations which 
are common to them all, insofar as this is possible. Iverson has 
developed a programming language and a notation for the description 
of programs. [16] Part of his notation is used throughout the 
remainder of this paper, See Table 1 for a summary of parameters 
and program symbols, 

Consider a system where the major subdivision of information 
is a file, Let each file be further subdivided into records, and 
the records into items, In order to realize the ability to mani- 
pulate the files as desired, some means must be established to 
recognize units of information down through the smallest, i. e. the 
item. Each item, therefore, must have associated with it a tag or 
key, or descriptor which is unique. In fact, it is useful to define 
the item as having two parts, 1) the key which is that part which 
distinguishes the item from all other items, and 2) the function 
which is that part which is not the key. 37a 

Each key corresponding to an item of the file is assumed to 


be an h-tuple, viz., 


s€ S, XS, X... XS, where s = an argument key 
aT ith set 
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and each of the sets, S. ,» is considered to have m mebers or 
h 
characters, Thus there are m distinct keys possible, e. g. if we 


restrict the keys to any six alphanumerics (a 6-tuple), then: 


h = 6 
m= 36 = 26 letters + 10 digits 
iis 36° ~ omxyuor 


The file is assumed to contain n items, all distinct, All 
items have the same probability of being selected, 


h . 
If a parameter d is chosen so that d = n, then d/m is a measure 


? 
of the frequency of occurrence of each character, 

The physical memory in which the file is stored is assumed to 
contain 2° words, so that "a" bits are required to address any word 
uniquely, The representation of each character is assumed to take 
b bits, where m = 2”, Thus hb bits are required to tepresent an 
h-tuple s. Therefore, in our example, b = 6 and hb = 36, 

Two types of criteria are used to compare the storage allioca- 
tion methods: the storage capacity required and the time required 
to perform a particular file-processing operation, 

Three measures of storage capacity are used; 1) the number, 

w , of bits per memory word required to represent both an item and 
any allocation information associated with that item (w = number of 
bits per memory word if the word length is variable, and equals 
some integral multiple of a standard word if the word iength is 
fixed), 2) the total number of words necessary to represent the 


file in a particular allocation, and 3) the total mumber of bits 


necessary, 
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It is convenient to normalize the latter two measures: [36] 


r(words) = 


total number of words to represent the file and associated allocation data 


number of words to represent the file 


r(bits) = 


total number of bits to represent the file and associated allocation data 


number of bits to represent the file 
By the assumptions of item and file size the denominators of r(words) 
and r({bits) are n and nb(h + e), respectively. However, computations 
are simplified if nbh is used as the number of bits to represent the 
file. 

In computing the storage capacities required, the locations 
needed to store the programs and intermediate operating data are 
ignored, This is justifiable for the programs (q.v.) for similar 
searches in different allocations all have approximately the same 
number of lines and use about the same number of parameters, Moreover, 
the fraction of total storage used for the programs is much less than 
that used for the file in applications where these schemes are 
necessary, 

Typical file-processing operations with respect to which time 
comparisons are made are: | 36 ] 


search for match 

search for multiple match 

search for smallest item 

search for item next smaller than a given item 
add an item to the file 


delete an item from the file 
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Iteration of either "smaller" search can sort the file. Iteration 
of the add-item operation may be construed as a recursive procedure 
for constructing the allocation, Thus if the time te construct the 


allocation is required, one computes 


where t. is the add-item time for a file of i items. 
Iwo items, a and b , are said to match (a = b) if and only if, 
noe all i, a, is the same character as b. : 


Two items, 


| 


and b, are said to match with respect to uw 

( wrt u), where u is a logical vector of dimension h, if and only if 
u/a = u/b ; that is, a, = b, whenu, = 1 (here we are using Iverson's 
notation. In particular, u/a is his "compression" operation). [16 | 

u is called a mask. If the mask u is not explicitly mentioned, it 

is assumed to be a vector or all 1's. 

An order relation is defined on S as follows: For those S. 
that are ordered, let ' < '"' be the relation "precedes", Roughly 
speaking, a < b , if, when the unordered sets are ignored and 
the ordered sets are considered to be nonnegative integers, the 
"number" a is less than the "number" b. 

The restriction of the ordering with respect to a mask u is 
defined: a<b wrt u if and only if_u/a < u/b. 

Those items s of file F which are such that, for all t in 


F,s<t wrt u, are the smallest items of the file wrt u. 


ly, 





The items next smaller than .9 wrt % are those items © in’ F 
such that t < s wrt u and for which there is no r in F such that 
t<xr<swrt uy. 

The definition of the ordering relation " < " requires that 
the elements of the h-tuple are weighted by the order in which they 
occur in the hetuple. This may be generalized to wtilize different 
orderings as specified by some permutation vector p. The restricted 
ordering with respect to mask u may still be used, so that 
a<¢b wrt u wrt p if and only if u/a ¢ u/b wrt u/p (The 
expression x € y wrt z is not ambiguous because a given z cannot 
be both a permutation vector and a mask vector except in the trivial 
case z = (1)). If the permutation p is not explicitly mentioned, 
it is assumed to be the identity permutation, 

A match search of file F for item x wrt u is an operation 
which identifies that item s oe F such that x = s wrt u. fo 
qualify as a match search u must be so chosen that the search identi- 
fies exactly no or one item of the file. If u is such that several 
items of the file may be identified, the search is called a multiple- 
match search, In general a multiple-match search is more complex 
than a match search so that it is assumed that the user of the file 
knows when to expect a unique result and when he will be content 
with a single response from many possibilities. 

As an example illustrating the possible uses of the suggested 
searches, consider the file to be a personnel record of a large 
corporation, [ 36 | In such a file each item would correspond to 


one employee and might be categorized as: 
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Name 
Social security number 
Department number 
Job code 
Salary 
Hence, each item is a 5-tuple. Note that in this example the sets 


of the file are: 


Rg 


the set of employees’ names 
S, # the set of employees' social security numbers 
S. «# the set of departments 
S, @ the set of job codes 
So fe the Set Gl Salaries 
We have obviously departed from our original example where the sets 
S5 were the set of alphanumeric characters, This is done to make 
the explanation of searches more easily seen. 
Probably the most. common search of such a file would be a 
match search to locate the item for a particular individual: e.g. 
u = (1,0,0,0,0) 
s = (Smith John A,-,-,-,-) 
To locate all the secretaries in department 4/74, a multiple- 
match search would be conducted with 
u = (0,0,1,1,0) 
s = (-,-,474,secry, -) 
The payroll department might desire an alphabetical listing 


of all employees by departments and salary; that is, sort the file 


Zi 





with major category "salary", then "department number", then alpha- 


betically by ''name'', To do this, a next larger search is iterated 


with 

t= (1 051051) 

pe= (3,-,2,-,1) 

s = last item retrieved 
and 


initial s = (A,-,9,-,0). 
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> Storage allocation systems and their relative merits. 

When reduced to basic sonoma ee all information storage 
systems for addressable storage within a digital computer memory 
use one or more applications of four basic types, i. e. random, 
ordered, computable, or chained, Each has its own advantages and 
disadvantages, As we shall see later, more complex systems which 
are combinations of these basic systems may be devised which allow 
optimization for particular applications. 

In this part the four simple storage allocation systems mentioned 
above are examined, For each, the description of the allocation, 
discussion of various searches, programs for the searches, and cal- 
culations of search times and storage requirements are given. The 
treatment here has been adapted from Sussenguth with minor modifica- 
tions. [36] 

The programs delineating some of the search descriptions are 
intended to convey a feeling for the type and amount of processing 
required, rather than being exact programs which cover all contin- 
gencies. Indeed, the details of initialization and special cases 
(e.g., multiple responses to a smaller search) are frequently 
omitted or condensed to a single program step. In contrast to this 


suppression of detail, in the main iterative loop of the program, 


the step of retrieving an item from memory 


a<—M 


is explicitly shown. It is this operation that is assumed to be 


the most time-consuming, and it is the number of times that this 


pas 





step is executed that is reflected in the search times which are 
calculated, If the operaticns of initializing, indexing, comparing, 
etc. are not carried out in parallel with the retrieval operation, 
or their execution times are not negligible with respect to it, 

the program will give an idea of how much the memory-access time 
must be increased to reflect a proper item time. 

When a program terminates at a normal exit (error exits are 
marked with an asterisk), the item satisfying the search is contained 
in the accumulator a, For those searches which may be satisfied 
by several items, the line "process a" is used to indicate that 
the present contents of the accumulator satisfy the search; after 
this line has been executed, the search continues until it terminates 
at an exit. 

The programs are all written in the Iverson language. Table f, 
p. vii gives the legend which is used for ail programs. 

The measure of time required to perform an operation is the 
number of items that must be “handled" to achieve the desired 


objective. "Handling an item," 


an intentionally vague phrase, may 
be construed as extracting an item from memory and transferring 

it to the processing unit in which a computation or decision is 
performed based on that item. In general, many items will be 
handled to preduce the file operation. By defining an item time 
in this way, any explicit real time measure (i.e€., number of 
seconds) is eliminated but, in so doing, one must avoid the 


impression that item times for different operations and different 


allocation schemes are the same with respect to real time. This 
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is, of course, not true in general. To give some feeling for the 
real time measure of an item time, simple skeleton programs for the 
basic file operations will be given for most storage allocations. 

The program steps may then be weighted in terms of operating times 

of a particular computer. The search times which are computed, then, 
are just a count of the number of file items which must be retrieved 
from memory to perform the desired task. 

If the program and intermediate operating data are stored in 
the same memory as the file, the real time value of an item should 
include the number of retrievals of both instructions and data re- 
quired to perform one iteration of the program, If, however, the 
program and intermediate parameters are stored in an auxiliary high- 
speed (relative to the file storage) memory, it may be safe to ignore 
their effect. In this latter case search times for different alloca- 
tions and different operations may be directly compared, 

The search times and storage ratios computed for each alloca- 
tion are summarized in Table 2,pp 4/7,48.There are three entries for 
each search: the maximum, minimum, and expected number oi item 
times to complete the search, Frequently the entries are approxi- 
mations of the expressions which have been derived. When applica- 
ble, the u or p for which the entries are calculated are shown; 
for most other u and p the search reduces to the corresponding 
search of the random allocation. Table 3 repeats the average values 


of Table 2 but with typical values chosen for the file parameters. 
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A. Random 

The simplest storage allocation system is that in which the items 
are stored in a random fashion in consecutive memory locations, 
Thus, there is no structure to this organization and it probably 
does not deserve being classified as an allocation system, It is 
important to consider it, however, because it serves as a basis of 
comparison for other systems, the cost of a particular allocation 
system being considerably less than the costs involved here if the 
Peer syetem is at all efficient. Moreover, many allocations are 
designed to achieve maximum efficiency for one or two particular 
operations and may be considered to be random structures with respect 
to other operations. An important example of this is the smailer 
searches performed with respect to a permutation other than the 
identity permutation. 

Searching for a match in the random structure consists in exam- 
ining the items one by one until the desired item is found (program 
Al); hence, the maximum search time occurs when the matching item 
is the last one of the file and thus requires n item times, the 
minimum time is 1 when the first item matches, and the expected 
time is n/2. 

For a multiple-match, x wrt u, all items must be tested 
(program A2) as there is no a priori way of determining how many 
items satisfy the criterion. 

Similarly, it is necessary to test all items to determine the 
smallest item or the item next smaller than x wrt u wrt p 


(program A3 and AG). 
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Adding an item to the file is particularly simple, for it is 
inserted at the end of the file. Deleting an item is only slightly 
more complicated: the item is located and replaced by the last item 
on the list (replacement is necessary to preserve the solid nature 
of the storage.) Thus, deletion item times are just the match search 
times plus one, 

As no additional structural information is required for this 
allocation scheme, r = l. 

B. Ordered 

In an ordered file the items have been sorted by some external 
means into an increasing order with respect to some u and are stored 
in consecutive locations of the memory. 

A dictionary provides a simple example of such a file. To 
locate a specified word the thumb index is used to find a location 
near the desired entry, then the catchwords at the top of the pages 
are used to narrow the number of candidate items to a few dozen, and 
finally an entry-by-entry search is made to find the required entry. 
Thus, in general, the search for a match in an ordered file may be 
conducted by successively finer scans of the file, the first scan 
determining a "general area,'' the succeeding scans reducing the 
number of candidates until an exhaustive search is performed, 
Program Bl shows such a search in which the elements of the vector 
¢ determine the coarseness of the scans. 

Obviously, the minimum search time is 1, wherein the first 


item tested matches, The maximum number of items which may be 
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Program Bl : 
Dictionary match search for i= 


. 


© 








tested in the first scan is | a/c, | = 81> the maximum on the 
second scan | 21/e9 = 85> etc., so that the overall maximum 


search time is 


Sis | 


a 


aaa, ei then the maximum time is hd. 


If n = ae and c = (d 
An estimate off ANE expected search time is found by assuming only 
one half the maximum number of items are tested on each scan, 
Another search technique for the ordered file, closely related 
to the "dictionary" search, is the binary search in which, at each 
access to the file, the number of possible candidate items is 
successively halved until either the retrieved item matches or only 
one candidate remains (program B2). Again the minimum search time 
is 1, The maximum time occurs when the last item tested matches; 
this time is frog, (m+1)] . To determine the expected search time, 
t, notice that on the initial entry to the file only one item may 
be tested, on the second entry either of two entries is tested, one 
1. 


i- ; , 
of four on the third, and one of 2 items is tested on the ith 


entry. Thus, 


t = > (probability of selection of ith item) X (number of 
entries to reach the ith item) 


awe -/ : + 
ie er ee eee 


AF | 
= goa (F- 9-1) 
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where g = [ 10g,(n + 1) | a If n = 2® and 2° >> g, the expected 
search time becomes g - 1. Note that this g is not the same factor 
as used later in the discussion of trees. 

The search for multiple matches of u/x is not difficult if 
the mask vector is a prefix of the vector on which the items are 
ordered: A match search is conducted to find any item satisfying 
u/x; successive neighboring items "below" this item are retrieved 
until an item fails to satisfy the u/x match; similarly, items "above'' 
the initial item are retrieved (program B3). The various search 
times are thus the appropriate match search time plus the number of 
items satisfying the match plus two (two items retrieved do not 
satisfy u/x). If u is not a prefix vector, the search is essen- 
tially that of the random structure although modifications may be 
made to account for leading 1's in u. 

Similarly, the search for the smallest item and the item next 
smaller than a given item become trivial if the search is with 
respect to the u on which the items are ordered, If this is not 


the case, the file must be considered just a random listing. 


i 
The closed forms of all nontrivial summations are given in 
the appendix. 
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To add an item to an ordered file, a match search is performed 
to determine its proper position and then other items are moved 
to allow for its insertion. If the location of the top of the file 
is fixed, i. e., not subject to modification, all of the items 
greater than the new item must be moved down to allow for its inser- 
tion. Thus, at most n items must be moved (new item is the smallest 
item), at least no items must be moved (new item is the greatest), 
PP an average of n/2 items must be moved. If neither the top nor 
the bottom of the file is fixed, the maximum number of items which 
must be moved is n/2 and the average number of n/4, Hence, to 
obtain approximate times for the addition of an item, the respective 
match search times are added to either n, n/2, n/4&, or 0, as appro- 
priate. Deletion times are determined in an analogous manner, 

As only the file items themselves are stored, w = hb and r = l. 

C. Computable 

A storage allocation of a file in which the location of an item 
is a function of a subset of its elements, i. e., 

Location (s) = f(u/s) 

is called a computable allocation, Frequently, f is a linear function 
although it generally need not be, 

A common example of a computable allocation is the storage of 
a matrix A in which the location of the coefficient ai is a 


linear function of i and j: 


location(A, ,) = KL tA4tT 


BS 





The common practice is, of course, to store only the coefficient 
Avy not the 3-tuple (1,5,4;,); at the designated location, 

This practice of not physically storing in memory those elements 
of the item used as the arguments of f is advantageous in that a 
shorter word length may be used. This gain may be compensated by 
a loss in efficiency in some search, however. For example, if 
matrix A is searched to find its smallest coefficient, it may be 
difficult to determine the associated indices i and j, if they are 
not explicitly stored. If such searches are infrequent, this dis- 
advantage is insignificant, and then it is necessary to stcre only 
the t/s portion of each items. Thus, the peculiarity of r(bits) 


being less than unity arises: indeed, if u/s has z elements, 


Peps a= aP2 ri 

The use of the word "search" in a computable allocation seems 
out of place, for an item is located by simply computing f(u/s). 
Nevertheless, the time to compute f and retrieve the associated 
item shall be considered to be the match-search item time. This 
time has been designated f rather than 1 in Table 2 to distinguish 
computation time from retrieval time; the relative magnitude of f 
with respect to 1 is not restricted, (In the matrix example, it 
may be necessary to retrieve®,A, and Y from the same memory as the 
items themselves, wherein f would be about 3; or & ®, and ¥ 
might be stored in a high-speed memory especially reserved for 


such constants, wherein f might be much less than 1.) 


34 





A multiple-match with respect to v involves computing f for 
all values v/u/S. (In the matrix example, retrieving all the items 
of a particular row is a multiple-match search.) The multiple- 
match time is a computation times, where s is the product of the 
orders of the sets V/u/S. 

Searches for the smallest item and the item next smaller than 
a given item also involve one computation. 

Notice that the preceding arguments have been based on the 
premise that the search arguments are included in the domain of f, 
viz. in u/S, and that the elements of the domain are all known. If 
this is not true, the computable allocations reduces to a random 
allocation. Moreover, the function f may be given arguments outside 
its domain and an ill-defined request may result in a legitimate item. 
In other words, the computation of the location should be a two-step 
process; It must first be ascertained if the argument u/s is in 
the domain of f£; then, if it is, f(u/s) is computed, For example, 


if A is an 8-by-8 matrix, and the value of is requested, one 


ns 
might obtain the value of =i without any error indication, 

The addition or deletion of an item to a computable allocation 
is very difficult, because a completely new f£ must be constructed in 
most cases (possibly involving a reallocation of many items also). 
This may be circumvented when deleting an item by retaining f but 


inserting a dummy item in place of the one being deleted. If the 


dummy item is retrieved in some search thereafter, it is ignored. 


oe 





D. Chained 

In a chained allocation with each item of the file is associated 
the addresses of other items of the file. Thus when one item is 
retrieved, information is supplied as to how to locate other items 
of the file, Chaining items together in this way eliminates the 
need for storing the file in consecutive memory locations, greatly 
simplifies the problems of adding or deleting items, and results in 
reasonable search times, 

The chained allocation has search times comparable to the random 
allocation if the items chained together do not reflect any structure 
inherent in the file. If, however, items with common characters are 
chained together, the search times can be substantially reduced. 

This is accomplished by chaining together items with similar characters 
in a given character position, Thus if the chaining is within set 

Sis i, e. the ith character position, and the set has m distinct 
elements, there will be m chains each of (average) length n/m, An 
example of a chained allocation is shown in Fig. 4. 

When searching for a match, then, that chain which agrees with 
the appropriate character of the argument is followed and each item 
retrieved is tested for the match condition by comparing it with 
the argument item (program Dl}. The maximum, minimum, and expected 
search times are n/m, 1, and n/2m respectively. 

A similar procedure may be followed for multiple-match searches, 
but the entire chain must be traversed to insure all matching items 


are located (program D2). 
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For these searches it is necessary that the element of u 
corresponding to the set on which the chaining is constructed is 
iy (ive. u, = 1 where the chaining is within S.). To insure this, 
it may be necessary to have sets of chains on several different 
character positions so that there is at least one chain which can 
be followed for all u which may be of interest, 

To search for the smallest item, the entire chain correspond- 
ing to the least character of the most significant character position 
is traced (i. e. follow the chain corresponding to the least char- 
acter of set S. where i is such that u, = l and for all 
j# ip< p.). (Program D3) 

The search for the item next smaller than a given item s is 
similar, but the chain corresponding to the same character of the 
most significant character position is followed (i.e. follow the 
chain corresponding to s. where i is defined as above). If it should 
happen that s is the least item in this chain, the chain corresponding 
to the character next less than Ss; is followed and the search is 
for the largest element of that chain (program D4). 


Thus, the search time for the smallest item is always n/m, 


’ 
and the time for the next smaller item depends on whether or not 

it is necessary to "back up" one character, As at most one back-up 
is required (because it is the character in the most significant 
position which is altered) and the probability that it is required 


is m/n (because there are n/m items in each chain), the maximum, 


an. VY ( m\ 
minimum, and expected search times are eae A and (/+ ~) = 


respectively. 
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To find the first item in each chain, an auxiliary file of 
starting addresses is required, This is a listing ot each character 
of each position which is chained and the address of the first item 
of the corresponding chain. Such files may be random, ordered, or 
computable allocations as may be appropriate, The time for searching 
the auxiliary file, which has m items, has been ignored in the search 
times of Table 2. 

Notice that the chained structure is similar to the computable 
structure in that some characters of each item need not be stored 
since their value is implicit in the chaining. Thus, if v is the 
logical vector with 1's corresponding to the sets which are chained, 
only the elements v/s need be physically stored, No loss of informa- 
tion is incurred in doing this if the last item in each chain is con- 
nected to the auxiliary file; thus the appropriate character value 
may be ascertained whenever an item is retrieved via another chain. 
Although eliding more than one character position may lengthen the 
search times considerably, r(bits) may be greatly reduced, especially 
if a << b. It is assumed no elision has occurred in computing item 
times, 

To add an item to a chained file, the item is written into any 
available location and the chaining addresses modified, A simple 
way to modify the chain addresses is to consider the new item to be 
the first in each chain so that the starting list addresses point 
to the new item and the addresses previously in the starting list 


are those of the new item. Hence, if there are c character positions 


4) 





which have chains, ctl items must be handled to add a new one 
(c starting items plus the new item). 

To delete an item it is necessary to lecate it by a match 
search and then replace the chaining address of its predecessor by 
the chaining address of the item being deleted. The item must be 
located in each of the c chains; following just one chain is not 
sufficient for once an item is located, its successors in each 
chain are known, but not its predecessors, Hence, c match-search 
times are required to delete an item. 

In a chained allocation each word must contain w = hb + ca 
bits, hb for the item and ca for c chain addresses, Since charac- 
ters which are chained together may be elided, it is possible to 
reduce the bit count to (h - c)b + ca. In addition to the n words 
of the file, cm words are required to hold the starting addresses, 
so that 


r(words) = 1 + = ; 


Similarly it is found that 


ey Se gente oe I 


hb n 
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i, Evaluation of Results 

Before attempting to compare the relative efficiencies of two 
allocation systems, one must be certain that like quantities are being 
compared, One of the bases for comparison that has been selected is 
the item time, which essentially reflects the number of file items 
that must be processed to achieve the desired result. It does not, 
however, reflect a real time measure and, hence, the equality of two 
search times does not imply equality of actual processing time, 
Therefore, before comparing allocations on the basis of the data of 
Table 2, some consideration of the real time values of the item times 
must be made, 

The second basis of comparison chosen was the storage capacity 
which the file, plus the additional allocation structure, required. 
Three measures of storage capacity were computed: storage ratios 
r(words) and r(bits) and the number of bits per word, w. While it is 


mathematically true that 
r(bits) = r(words) x w/bh 


this equation may not be true in a practical sense depending on the 
characteristics of the storage medium and the data processor. Suppose 
the memory and processor have a fixed word length of N bits, [If 

N/2 < wc Nn 
for a particular file and application, then r(bits) has no real 
significance because one full word must be used to hold the necessary 
allocation data, wasting N - w bits of each word. Bits are similarly 


wasted if N < w and w is not an integral number of words, If, 
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however, w = N/2, and the computer has the ability to process half- 
words efficiently, two allocation items may be packed into one word. 
Conversely, if the computer has a variable field-length feature 

(and many computers <ibes r(words) loses significance because each 
allocation item uses only as many bits as necessary. 

Thus, in comparing allocations with respect to required storage 
capacity, the characteristics of a particular computer should be 
considered to determine whether r(bits) or r(words) reflects the 
storage efficiency more properly. 

It is by no means clear that operating time and storage capa- 
city constitute reasonable measures, Indeed it is easy to improve 
one at the expense of the other, For example, a random allocation 
requires essentially no maintenance, has both r(words) and r(bits) 
as small as possible, but has very long search times; at the other 
extreme, duplicating the file in all conceivably useful configura- 
tions is extremely wasteful of storage space but permits rapid search- 
ing. One function which overcomes this type of difficulty and has a 
reasonable physical interpretation is the product of search time 
and storage capacity. The product may be thought of as the cost of 
operating the system, for the storage capacity is a measure of the 
amount of equipment required and the search time is a measure of 


the amount of time the equipment is in use. 


log. IBM 7030, 7080, 705, 1401, 1620; RCA 501, 301;Univac 1107, 
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The nature of the user's problem is also an important and 
obvious factor in the determination of relative efficiency. If, 
for example, additions to or deletions from the file are rare, it 
is meaningless to use these times as a basis of comparison, If, 
however, the number of additions to the file is large, one might 
consider a composite allocation of a main ordered file and a sub- 
sidiary random file, which is searched if the item is not located 
in the main file; by merging the new items all at once into the main 
file when, say, the random file becomes large, an efficient system 
may be developed. 

With these words of caution in mind, let us now summarize the 
results of the previous parts. Tables 2 and 3 will be of assistance. 

The random allocation has no desirable qualities other than 
the case of adding items and, hence, finds little application 
except as a temporary auxiliary file, as noted above, 

The ordered allocation, which is probably the most widely used 
allocation system, is quite efficient with respect to both storage 
and search time, for files with infrequent changes. If the rate of 
change is not low, the file may be padded with dummy items, so that 
the deletion of an item is accomplished by merely designating it as 
a dummy and the addition of an item will involve moving only a few 
items, If a dictionary-type search is used instead of the binary 
search, slight rearrangements of the proper order within the fine 
scan range may be tolerable, thereby permitting even few item 


movements when adding an item, 
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A computable allocation is also quite efficient in both storage 
and search time, but the file items must have very special properties 
to allow the construction of a reasonable function, Therefore, 
when a computable allocation is possible, it probably is not recog- 
nized as such, but rather, the file is treated as a mathematical 
entity (e.g. a matrix or vector). A computable system is frequently 
used within another system, e.g. as the coarsest can in a dictionary 
search. Often the function evaluation only involves the use of index 
registers and indirect addressing. 

Chained structures appear rather wasteful with respect to both 
search time and number of bits required. However, the main use of 
such structures has been, not for searching for matching items, but 
for linking complex data or program chains which are frequentiy 
changed, Chained structures have found wide use in symbol-manipula- 


ting, game-playing, and theorem-proving programs. 
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be Representation by abstract trees. 

The abstraction of a storage scheme which is a combination of 
two or more of the basic schemes discussed above is called a Tree. 
Organization of information, or keys to items of information, in the 
form of a tree allows utilization of certain advantages of the com- 
bined schemes with a minimum trade-off in terms of combination 
generated disadvantages, The method of combination and the variance 
of tree parameters allow the programmer an opportunity to maximize 
the desireable features of a given combination, This further allows 
scientific selection and provides a means to predict performance, 

It is convenient to collect definitions used in the following 
discussion of tree structures. The terminology varies slightly with 
various authors but we will use the terms adopted by Iverson and 
Sussenguth, [16] sw Several terms are illustrated in Figure 5. 

A graph comprises a set of nodes and a set of associations 
specified between pairs of nodes. If node i is associated with node 
j, the association is called a branch from the initial node i to the 
terminal node j. A path is a sequence of branches such that the 
terminal node of each branch coincides with the initial node of the 
succeeding branch. Node j is reachable from node i if there 1s a 


path from node i to node j. The number of branches in a path is the 


art may bother some readers that the pictorial representation of 
trees do not always branch upward as do natural trees. Such is not 
the case, however, and the literature abounds with trees "growing" 
every which way. Branching downward seems the most comfortable 
possibly as the result of the general familiarity with the organiza- 
tion chart heirarchical trees (with the boss on top). 
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length of the path. A circuit is a path in which the initial node 
coincides with the terminal node. 

A tree is a graph which contains no circuits and has at most 
one branch entering each node. A root of a tree is a node which has 
no branches entering it, and a leaf is a node which has no branches 
leaving it. A root is said to lie on the first level of the tree, 
and a node which lies at the end of a path of length j-l from a root 
is on the jth level. A tree which contains n roots is said to be 
n-tuply rooted, and if n = 1 it is called a rooted tree, The set of 
nodes which lie at the end of a path of length one frem node »x 
comprises the filial or sib set of node x, and x is the parent node 
of that set. The nodes within a sib set are sometimes referred to 
as sisters. The set of nodes reachable from nede ~ is said to be 
governed by x and comprises the nodes of the subtree rooted at x. A 
chain is a tree which has at most one branch leaving each node. The 
number of branches leaving a node is called its branching ratic or 
degree. In particular, the degree of each leaf is zero. 

A tree is said to be uniform if all nodes on level j are utilized 
(or "filled") before the tree is expanded to include nodes on level 
j + 1. Each level of the tree corresponds to one of the character 
sets Sy of a file, Such a tree is illustrated in Figure 6, Only the 
leaves will be of particular interest since they correspond to the 
items of a file. The remaining nodes are dummy nodes inserted to 


fill out the tree for search purposes. 
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That set of nodes on level j which are the leaves of the subtree 
subtended by a node gs on level j-k is called the kth sib set of node gs 
and s is called the kth parent of the sib set. In general, the 
interest will be in first sib sets, which are those nodes "directly 
reachable" from a given node (the parent node), and, for brevity, 
first sib sets will be called just sib sets when no ambiguity will 
result. The members of a first sib set are called sibs. In a uni- 
form d-way tree, (first) sib sets have exactly d members. The defini- 
tion of sib set is extended so that the set of roots constitutes a 
sib set but the alternate term filial set is probably better in this 
case, In any listing of the members of a sib set, some node must be 
mentioned Peco That node will be called the heir node of the 
sib set. Sib sets are important because tree allocations are fre- 
quently characterized by the manner in which the sib sets are allocated. 

In order to show the necessary memory space for storage, consider 
a tree of height h (the number of character sets) and n leaves (the 
number of items in the file). The common degree d (the average 
number of characters in each set) is such that ae =n. In such a 


tree, the kth level has a nodes. Hence, the total number of nodes 


in a tree of height h is (See Appendix) ; 
A : 
gre gaol 
A-! 
=| 


i . , 7 
No additional ordering relation on the character sets is 
assumed or implied. 
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If one word is used to represent one node and if n 1s assumed larg:, 


r(words) = d 
d-1 

for the tree allocation system. However, there is no general one-to- 
one correspondence between nodes and words in computers with fixed 
word length. 

The nodes of a tree may be grouped in many ways. One of the 
simpler ways is to chain each node to its heir (if it has one). 
The sib sets of which the heirs are members may then be stored in 
consecutive memory locations (blocks) in a random, ordered, or com- 
putable manner (Fig. 7). In addition, the sib sets themselves may 
also be chained together, so that the tree has two types of chains 
associated with it: one joining heirs and one joining sib sets 
(Fig. 8). A third method is to store heirs in memory blocks and to 
chain the sib sets (Fig. 9). One other useful chain for file- 
processing is a chain with similar characters of the same character 
set (tree level). See Figure 10. The three types of tree chains 
are called heir chains,sib chains and character chains respectively. 

The character chains are superfluous for tree structures and, 
hence, need be included only to speed up various seavoneee: However , 
the concepts of heir and sib sets are essential to the tree structure, 
If heirs are grouped into consecutive memory locations and the sib 


sets chained, the resulting structure strongly resembles Iverson's 


i: | 
See page 75 for search: with the mask vector 4 not a full 
vector or a prefix vector. 
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meoht List. jie] Indeed, it is exactly a right list if the heir 
blocks are stored in the propet order, 1.e€., if the chagjm is determined 
by the right list degree vector, Similarly, if sib sets are grouped 
into blocks and the heirs chained, the structure resembles a left 

rise. 

Each leaf of the tree corresponds to an item cf the file. The 
intermediate nodes are dummies inserted to fill out the tree and form 
a path which is traced from root to leaf via the heir and sib links. 
The nodes on level i are identified with set Ss and within each sib 
set on level i there are enough nodes to correspond to the characters 
of set S. - There are at most min each sib set, and there are d on 
the average. The memory word or words correspending to a given node 
contains the representation of the character associated with the node 
(plus any chaining addresses and data that may be needed). The label 
or name of a node is the concatenation of the characters of the path 
leading to that node from a oe. Thus the label of a leaf is just 
an item (In Fig. 5 the character associated with a node is underlined 


and the remaining part of its label is not). 


ait should be noted at this point that the labels of rodes are 
not always simple characters or concatenations of characters 
(alphabetic or numeric). In some applications of tree structures, 
the labels are entire words or strings of words. The problem of word 
length is more acute in these cases but the principles of manipula- 
tion are the same. Therefore the tree discussion will, for clarity, 
be in terms of simple characters. 
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A. Methods for computer storage. 

For the purpose of computer storage of a tree structure, it 
is useful to consider the label or name of each node and its associated 
location-relation parameters as a vector. This vector is then con- 
sidered as the jth row of a chain list matrix. See Figure ll. The 
nodes, and thus the rows of the matrix, may be listed in three 
different orders: 1) arbitrary (random) order, for chained heirs 
and chained sib set type trees, 2) level order, for chained heirs and 
block sib set type trees, and 3) subtree order for block heirs and 
chained sib set type trees. See Figure 1l. The matrix representa- 
tions of tree structures in Figure 11 include additional relationships 
which serve to simplify programming. Indeed, when the nodes are 
listed in either level or subtree order, the second and the third 
columns (names and degrees) of the matrices L and S may be shown to 
pemsutticient. [32] 

The big advantage, of course, to the arbitrary or random ordered 
matrix is that 1t is easy to add or delete nodes and effect other 
structural changes as required in many dynamic systems. No re- 
shuffling is needed at any time. If the nodes are listed in level 
or subtree order, much of the structure is inherent in the order of 
listing and less memory is wasted on structure information. However, 
a severe price is sometimes paid when information too lengthy for 
the available extra space is inserted thereby requiring relocation 


of major portions of the matrix. 
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The factor of computer word-length enters into the consideration 
at this point. As we have previously pointed out, computers that 
have variable word lengths free the programmer from any concern over 
the row vector fit. When dealing with fixed word length machines, 
inefficiencies will be inevitable when the row vector length is not 
an even multiple of the computer word length, 

Memory access systems for processing variable length operands 
in fixed word length computers have been proposed as a solution to 
this problem. Henderson and Flynn have suggested a scheme which 
addresses the characters of the memory words vice the words them- 
selves, [7] A control triple contains the character address, the 
number of characters or length of the operand, and a single bit which 
Signals right or left justification. Masking and shifting operations 
are performed in the hardware and entail no time penalty. 

As an example of the necessary word length computation, consider 
again the row vectors of the matrices in Figure 11. The indices in 
A> L,;> and a are the GHSaaes Oe the computer words containing 
each row vector. If we allow the names of nodes to be any alpha- 
numeric, then b = 6 bits and m = 2 =¢4 2 d. We pointed out 
previously (p.60 ) that in some applications of trees the names 
of the nodes are more complex than a single alphanumeric character. 
In these cases the character may be replaced by the address of a 


list which may be any length 2 1. See Case 2 below: 
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Case 1: Alphanumeric name 


A 6 L 6 S 6 

AS is “s 6 s¢ 6 

Ay, 15 Li, 15 Si, hs 

A. 15 PAF | 08 We tage" 27 bits 
51 bits 

Case 2: Complex name 

A WS L 15 S iS 

AS 15 Ls 6 = 6 

A, LS Ly 5 Sh, 15 

A. 15 36° DIes 36 bits 
60 bits 


Two distinct cases occur when dealing with multi-tree structures. 
The first is the case in which all the nodes are considered distinct. 
The second is the case in which many nodes exist in more than one 
rooted subtree. Note that the nodes within each rocted subtree are 
considered distinct, Due to the greater complexity of the minimum 
matrix representation for the second case, its use should be con- 
sidered only when the number of nodes which are unique to one given 
tree is less than one half the total number of nodes. [19] 

A heirarchical tree is an example for which it is expected that 
each node will appear only once, while syntactical trees representing 
sentence structures is an example for which a large number cf nodes 
(words) will be repeated. 

The problem of non-distinct or overlapping nodes is basic in 
the application of list and tree structures, One of the important 
merits of list processors is that data having multiple occurrences 
often need not be stored more than one place in the computer, One 


may visualize this situation as the overlapping or intersection of 
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lists, and its utilization may be regarded as a step toward global 
optimization of storage allocation. 

However, overlapping lists pose a problem when 1t becomes 
necessary to erase a particular list or node and return the space 
to the list of available space (LAS). When a given list is no longer 
needed, it is desired to erase just those parts that do not overlap 
other lists still in use, There is no general method of doing this 
short of making a survey of all lists in memory. [3] 

Several solutions to this problem have been found, each with 
its own disadvantages or limitations. A solution incorporated by 
McCarthy in his LISP merely abandons lists as they are no longer 
needed, [22] Then, when the LAS has been depleted, a survey is 
made of all registers currently employed in non-abandoned lists and 
the complement of this set is returned to the LAS, 

There are two sources of inefficiency in this method, Depend- 
ing on the computer, the application, the programmer's skill, etc., 
the time needed to reclaim unused storage space 1s nearly independent 
of the amount of space reclaimed, The efficiency, therefore, drops 
off rapidly as the memory approaches capacity. Second, the method 
as used by McCarthy required that a bit (in this case, the sign bit) 
be reserved in each word for tagging accessible registers during the 
survey of accessibility. If data consists of signed integers or 
floating-point numbers, this results in awkwardness and further 
loss of efficiency. This method can be modified by setting aside a 
block of memory and establishing a one-to-one correspondence between 


the bits in this block and the words in the remaining memory. This 
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modification eliminates the awkwardness of arithmetic, but takes 
a further toll in the speed of reclamation. [3] 

A second method, used by Gelernter et al in FLPL, consists of 
adopting the convention that each list component is "owned" by 
exactly one list, and "borrowed" by all other lists in which it 
appears, [10] A "priority bit" is then used in each "location 
word" to indicate whether the list entry referenced is owned or 
borrowed, An erasing routine is used which erases only those parts 
of a list that are owned. This system obviously works only so long 
as no list is erased that owns a part that has been borrowed by 
some other list not yet erased. 

A third method has been developed by Collins as a means to 
overcoming the disadvantages of the first two. [3] The method is 
based on the interspersion of words containing reference counts, 
Viewed in terms of the conventional diagrams of lists, such a 
reference count is a tally of the number of arrows point to the 
box containing the Pereree count and emanating from boxes in the 
same or other lists. See pp. 5+, 6&:.. <A two-bit type code then 


L ; 
appears in each location word one value of which serves to identify 


3 


Srech list is represented by a set of location words and data 
words, Data words contain atoms and consist of a single field, 
Location words are divided into fields which contain a type code, 
the address of the next location word, and a field which may contain 
any one of four things, determined as follows by the type code: 


Location of an atom 

Atom (if max, no, of bits small enough) 
Location of a list 

Reference count 


WnNre © 
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a reference count as such. The obvious convention is adopted that, 
unless a word contains a reference count, there is exactly one 

arrow pointing to it. A reference count of one is thereby redundant 
but otherwise harmless. 

A disadvantage of this system is that atoms cannot be borrowed, 
Often this will not be a significant encumbrance. When it is, it 
can be partially solved by uniformly replacing each atom, a, by a 
list whose only term is a, 

Finally, a method slightly different from the solutions already 
given is the use of an auxiliary aid to keep tabs on distinct nodes 
and successor pairs. This aid is called a connection matrix. A clear 
description of this matrix is too lengthy to be given here. Suffice 
to say that it exists apart from the chain list matrix which describes 
the tree but the chain list matrix in this case contains only distinct 
nodes, See [32 | for a brief description and [19] for a more 
rigorous treatment, 

B. Search operations for trees, 

We previously discussed search operations when data was stored 
according to any one of the four basic schemes, i.e. random, ordered, 
computable and chained allocations, The same searches may be applied 
to the tree allocations with minor modifications since the tree 
allocations employ various combinations of the basic aliccations, 


We shall again consider: 
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Match search 

Multiple-match search 

Smallest search 

Next smaller search 

Add an item 

Delete an item 
It will be assumed that the tree under consideration 1s uniform for 
the derivation of efficiencies but this restriction does not apply 
to the search procedures themselves, As in Section 4, the treatment 
here is adapted from Sussenguth with minor modifications. 136] 

foeconduct a match search inea Cree allocation, One @irrst 
searches the sib set of roots to locate the node whose associated 
character is the same as the first character of the argument. Then 
one proceeds to the first sib set. of this node and searches again 
to locate the node with character matching the second character of 
the argument. This process continues through level h, The manner 
in which the sib sets are searched and the manner in which the next 
sib set is located depend upon the particular allocation, 
Consider first the allocation in which heirs are chained and 

the sib sets are computable allocations within blocks. For a 
match search, h words must be retrieved from memery, one for each 
level. The address of the root word is computed for the first 
element of the argument and an initial address parameter, The root 
word will contain the parameter required to locate its sib set, and 
this parameter and the second element of the argument are used to 


compute the address of the word corresponding to the proper node 
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on the second tree level. This process is repeated for all 4h levels 
(program Tl), thereby entailing h memory accesses and h function 
evaluations, 

As a simple example of such an allocation consider the character 
sets to be the integers from 0 to m-l, and let the sib sets be stored 
in order, At each node is stored the address « of its heir node. 
Thus, if the node of the kth level has just been retrieved, the 
address of the node of the (k+l)th level is just & + Xa] where 
x is the argument of the search, 

An allocation similar to this is that in which the heirs are 
chained but the sib sets are random allocations within blocks. The 
chaining from node to heir is accomplished by extracting the chain 
address which is stored at each node. The proper node of the sib 
set is obtained by comparing the stored character with the appro- 
priate character of the argument; if they match, the proper node 
of the sib set has been found; if they do not match, the next memory 
word, which is the next node in the sib set, is tested. When the 
proper node has been found, its heir is located. This is repeated 
for all h levels (program T2). Thus, the search time is composed 
of the h item times needed to pass from root te leaf via the heir 
chain plus the item times needed in the sib set searches, These 
are at most d-l per set, at least none per set, and (d-1)/2 per 
set on the average. 

The search procedure for the allocation in which both the heirs 
and the sib sets are chained is exactly the same as the search just 


described except that, when searching the sib sets, the sib chaining 
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Program Tl 
Match search for x 
Heir chains, computable sib sets 







i <— location of 
first root 
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Progran T2 Program T3 
Match search for x Match search for x 
Heir chains, random sib sets Heir chains, sib chains 


70 





address is used to locate the next node rather than consecutive 
memory locations. (Notice that program T3 is program T2 but for 
one line). 

Because these three allocations are so similar, the details of 
the remaining searches will be discussed only for doubly-chained 
Erees, 

The items satisfying the multiple-match search when u is a 
prefix vector are those items corresponding to the subtree subtended 
by the node with label u/x. Program T4 delineates such a search: 
First, the node u/x is located by a match search, The leaves of 
its subtree are found by finding the heir node of u/x, then the heir 
of the heir, etc., until the heir which is a leaf is found; its sibs 
are retrieved. Then, it is necessary to "back-up" one level to the 
parent node of the sib set just retrieved, find the next node of 
the sib set of that level, move "forward" to its heir, and retrieve 
its associated sib set. This process is repeated until the entire 
subtree has been retrieved, Thus, if the u/x node is in level h-k 
(i.e. u has h-k 1's) the as subtended leaves are retrieved by 
locating fos sib sets; the parents of these sib sets, in turn, 
form as-2 sib sets themselves, and so on back to the u/x node, 


k 
Thus to retrieve the s = d_ subtended leaves, 


Diet psi 


items must be handled. (The factor of 2 occurs because each non- 


leaf node is retrieved twice, one in the forward direction and once 
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Match search 


jwm—m—je+il i<—— location of 
first root 





Program T4 Program T5 
Kultiple-ratch search ! Smallest search i 
for x wrt prefix u Heir chains, sib chains 
Heir chains, sib chains | | 
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in the backup.) The match search time to retrieve the u/x node must 
be added to this to determine the multiple-match search time, 

To accomplish this search, it is necessary to locate the parent. 
of a given node; the chaining structure introduced thus far does not 
permit this, However, a simple modification can provide this facility: 
The last node in each sib set has nothing chained to it in the sib 
chain; therefore, its address portion is vacant and may be used to 
indicate the location of the parent of the sib set. To avoid an 
Peieuity’ of having a "backward-heir" chain address in the portion 
of the word assigned to sib chains, it is necessary to add an extra 
marker bit to the sib chain addresses which indicated to which chain 
the address belongs. These "backward-heir" or parent chains are 
indicated by two broken lines in Fig. & and the marker bit is denoted 
by the symbol A in the programs. 

The tree allocations with no sib chains cannot have the back-up 
property as described. Frequently, however, this may be circumvented 
by additional bookkeeping in the search programs, which may entail 
a considerable expenditure of time with respect to the retrieval of 
a single address, If there is sufficient word length available, an 
additional link may be included to indicate the parent node as 


illustrated in Figure 11, Column Aa. 


There is no ambiguity if the number of nodes in the sib set is 
known. This will not, however, be the case in general, 


fe 





The smallest search in the doubly-chained tree allocation 
proceeds exactly as the match search, except that the smallest charac- 
ter in each sib set is selected rather than a matching character, 
(program T5). 

To search for the item next smaller than a given item is more 


complex, however, One first locates the given items then its sibs 


« 
are searched to find a smaller item. (Specifically, since the words 
contain only a single character, the sib set is searched for the 
character next less than the least significant character of the argu- 
ment). If such an item exists, the search is completed. If not, that 
is if the given item is the smallest in the set, the parent node must 
be retrieved and its sibs searched for a smaller character. [if one 
is found, its first sib set is searched for its largest item and 
the search ends, However, it may still be necessary to back-up 
another igen Thus, if k back-ups are required, the search is 
composed of one match search,2k links in heir chains (k in each 
direction), and next smaller or largest searches through 2k + 1 sib- 
sets (the original sib set of leaves, k searched for the next smaller 
item, and k for the largest), 

Since about d item times are required to search a sib set, 
the search times may be calculated once the appropriate walues for 
k are known. Clearly, the minimum k is 0, and the maximum h ~ 1 


(the argument is the smallest item of the file). To ascertain the 


‘in the chained allocation at most one back-up was needed because 
the entire item could be examined and the most significant character 
changed, Here several back-ups may be needed because only one character 
may be examined at a time, 
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average k, consider the character sets to be the integers between 


h-k h-k- 
QO and d-l. In the file there are d -d ie items with exactly 


k trailing O's, and for each such item k back-ups are necessary. Aare 





£20 

= ng” =k =h} 
oe \a-1 

2 1 
ae 


In heir-chained tree allocations the procedures for adding 
or deleting items are analogous to the procedures for computable, 
random, or chained files according as the sib sets are computable, 
random, or chained, 

The tree allocation schemes as described thus far allow efficient 
searches only when the mask vector u is either the full vector or 
a prefix vector, For example, to perform a match when u ¢< (0,1,...), 
it is necessary to search all of the nodes of the second level to 
locate characters matching the second character of the argument. 
To test all the second level nodes, it is necessary to proceed from 
a root to its sib set, back to the root, down to the next root, to 


its sib set, etc, This could be a time-consuming process, especially 


T5e6 appendix. 
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if u had several leading O's, 

If, however, the allocation includes character chains (Figure 10), 
arbitrary u vectors are permissible, Consider the multiple-match 
search x wrt u for a tree which has heir, sib, and character chains, 


Assume u has 1's in positions u., and O's elsewhere. 


; la =32 geeos U 


jk 
The character chain of ae is traced to its first node. Then the 
Ci, - jy.) th parent of that node is found by proceeding via the 


parent chain for j, levels, If that parent node does not have 


~ then 
character value mee? this portion of the search may be terminated, 
and the next node of the character chain tested, If the parent node 


does have value ell 


however, find its parent in the J,pigth level 
and test it similarly. If the original candidate node is such that 


all its parents in levels j)> Jnoeee, have character values 


J 
matching x, then the leaves of the subtree subtended by that node 
constitute part of the set of items satisfying the search (program T6), 
C. Evaluation of results. 

Tree allocation systems appear both efficient and versatile, 
Most files can be allocated in a tree structure without difficulty. 


One set, say S,, is chosen to correspond to the roots, and ali items 


3 
having the same first character are placed in the subtree subtended 
by the root corresponding to that character, Sib sets in the second 
tree level are formed from the characters of another vet, say Sos 
and then their sib sets, through h levels. if the file items are 
correlated (i.e, have common characters), and it is reasonable to 


expect that such correlation would exist in practical applications, 


the number of dummy nodes which must be added is relatively 
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Pall. Indeed, for many trees fewer bits are required to store 
the file and its associated structure than are required to store the 
file on an item-by-item basis, This apparent paradox vanishes when 
it is observed that one character at the root of the tree is shared 
by all the items of the subtree subtended by that root, Moreover, 
in a tree allocation it is possible to perform ail of the operations 
considered with relative ease. More complex operations (e.g. search 
for the item next smaller than a given item with respect to an arbi- 
trary permutation) can be accomplished with more elaborate programs, 
When comparing the relative efficiencies of the tree allocations 
the same considerations regarding like quantities as were discussed 
before apply (p.43) , i. e. time and required storage capacity with 
due regard for their inverse relationship. Table 4 is a summary of 
search times and storage ratios. As in Table 2, there are three 
entries for each search: the maximum, minimum, and expected number 
of item times to complete the search, Frequently the entries are 
approximations to the expressions which have been derived, When 
applicable, the u and p for which the entries are calculated are 
shown, Table 5 repeats the average values of Table 4 but with typical 


values chosen for the file parameters, 


a: is easy to show this number, which is nf(l-r(words)), is 
minimized if the set of smallest order is chosen to correspond to 
the roots, the set of next smallest order to the second level, etc. 
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(h + l)d hd + 1 
Delete 
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Item — 


$(hd + h + d) 3 (hd + bh + 2) 


Storage Ratio 
(words) 


Storage Ratio 
(bits) 


Bits per Word 





Mask vector u restriction 
1. mask must be prefix of tree (level) order 
Permutation vector p restriction 


2. identity permutation only 
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7, Minimization of expected search time, and the search-time, 
storage-space product, | 
We have seen the general reduction in item times with only a 


slight increase in storage ratios when considering tree allocations 


vs. the four basic allocations, Note Tables 2 and 3 vs. Tables 4 


and 5. The item times given in Table 5 are dependent on the selection 


of the typical parameters shown just beneath the table. It is the 
purpose of the following discussion to show some calcuiations based 
on these parameters which will optimize the desired characteristics 
of the tree structure, 

The parameters shown in Table 5 are defined in Table 1 and are 


reiterated here for convenience: 


a= 15 number of bits to specify an address 

b= 6 number of bits coueeacuune a character 

c= 3 number of character positions which are chained 
f= 2 number of item times to compute function f 

h= 6 number of character sets (height of the tree) 
m= 30 number of members of each character set 

n = 20,000 number of items in the file 


s = 100 number of items satisfying a multiple-match search 
Parameters a and b are characteristics due to binary coding 
and the machine in use. c and f are included to allow calculations 
for the "Tree with computable sibs" column, We disregard these now 
since the difficulty with add item and delete item functions make 
this scheme unfeasible except in very limited applications, nm and 


s are characteristics of the file used for comparison and are 


therefore beyond our control, This leaves h and m, the only two 
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parameters available for manipulation. 

If there are m nodes in a filial set, the expected number 
which must be tested before a match between a nodé value and the 
appropriate element of the query is 4m. Thus the expected number 
of chaining links required to find a match within one filial set 
is % (m1); one link to reach the filial set and % (m - 1) to 
search it. 

Assuming the search time is proportional to the number of 
chaining links traversed, the expected search time may be calculated 
if all the filial set sizes are known, However, it is inconvenient 
to use the set of all filial set sizes in macroscopic calculation, 
For computational purposes an average filial set size for each 
tree level is defined as; 


number of nodes on level i 
“ = (1) 


number of filial sets on level i 
Since the average time to search a filial set on the ith level is 
4(m, + 1), the expected time to search a file of n items allocated 


as anh level tree is 


4 
t=' > (m+). (2) 
ye 


Equation (2) gives the expected search time in terms of the para- 


meters m, and h, These parameters are related by the expression 


i 
m, =n, (3) 
i i 


a=] 
because the product of the filial set sizes is the number of leaves, 


| 37] 
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The binary coding of alphanumeric characters allows additional 
flexibility in the variation of m, provided the key elements may 
be manipulated freely. The key may then be considered to consist 
of a single string of binary digits rather than several disjoint 
elements, each consisting of several binary digits. For example, 
the key "CAT" is considered as the binary string '"010011010001110011" 
rather than the set of distinct elements "010011", "010001", and 
"110011", With keys of this format, the binary digits may be 
grouped to give the most efficient search system, i.e. m, and h 
may be selected without constraint (other than (3)) to minimize t. 
An example of this division technique is shown for several alpha- 
betic characters in Figure 12, [37 | 

Noting the form of equations (2) and (3), it is clear that the 
expected search time is independent of the order in which the levels 
are taken and that the minimum search time is achieved when the 
m, are all equal, If the m, are all equal, (3) reduces to 
a =n and (2) becomes 


t = § h(m+ 1) = 4(m+ 1) log | n (4) 


Taking the first derivative of this expression with respect 
to m and setting it equal to zero, we find a minimum t at m = 3,6. 
Equation (4) is normalized and plotted in Figure 13, By simple 
arithmetic we find 
t = 2,3 log, ¢ n= 1,24 log. n 
That is, the expected search time in the optimum case is only 24 


percent slower than the binary search time, Moreover, since the 
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time required to add or delete an item to or from the file is 
approximately the same as the search time, a considerable improve- 
ment over the file allocated for binary searching (ordered alloca- 
tion). 

Therefore in summarizing the case when it is possible to freely 
manipulate keys, the key elements should be selected so that 

1) all paths from a root to a leaf have the same length, h; 

2) all filial sets have the same number of members, m; 
and 3) the common filial set size m is near 3.6 and h=log_ ae liza 

The above rules for the adjustment of parameters are based on 
the consideration of minimum search time only. A more realistic 
value may be obtained by also considering the cost due to the higher 
storage ratio caused by small filial set size, The total number of 
(dummy) nodes in the tree which serve only to index the leaves 
(items) is 

ve Tm, 
sp Lt (5) 


Tf it is assumed that all filial sets m, are equal (5) becomes 


rad ie en (6) 





Equation (6) shows that as m increases, the number of required 


Storage locations decreases. 


87 





We have previously defined a meaningful measure of efficiency 
for storage allocations as the product of search time and storage 


capacity. Using (4) and (6) 


h 
C=txwWe [= nm + 1)] x [=(2 =)! 
nit 7; 


hm gmt 1 ) (m - 1) (7) 


Equation (7) is plotted in Figure 14 in normalized form, The minimum 
cost is achieved when m = 5,3. 

It is useful to note that in both Figure 13 and Figure 14, the 
curves are shallow at the minima, The values of m may therefore 
vary from approximately three to eight with less than 20 percent 
degradation, 

When it becomes necessary to use an m greater than these 
optimum values, the efficiency of the system will suffer. Numeric 
key elements would be acceptable since m= 10, But when the keys 
are alphabetic (m = 26) or alphanumeric (m > 36), it becomes 
necessary to add refinements, 

In the discussion of efficiencies above, a tree with chained 
heirs and +r angi chained sib sets was implied. We have seen that 
this sort of doubly-chained tree requires two address fields 
(one for the heir link and one for the sib set link) plus a field 
for the label. A slightly more complex allocation in which the 
sib sets are chained in an ordered arrangement (this will permit a 
binary type search within each sib set) will be efficient for 
larger m. To accomplish this ordering for a binary search within 


a filial set, it is necessary to expand the memory map (matrix row 
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vectors) to three addresses per node, corresponding to the states: 
greater than, equal to, and less than, In the two unequal condi- 
tions the address is that of the next sister node to be examined; 
the equal condition address leads to the filial set entry at the 
next level, or if the termination symbol has been recognized, to 
the location of the desired item, non 

On a variable word length computer this revision would demand 
an approximate 50 percent increase in storage requirements, Less 
flexible machines would probably require a 100 percent increase, 
Even in this latter case, the proposed method is superior as we 
shall now show. 

Consider a filial set of size m. [If the ith member of the 
set is selected as the starting point in the search, the original 
set is partitioned into three subsets with 1, m-n, andn - 1 
members, The average search times for these sets are 1, 1+T 


m-n 


and 1 + TO respectively, The average search time is therefore 


-1 


T 
m 


a fi+Q+t_ ) @n)+ +7) @D] 


m 
1 + x9 > } (Tag ) (m-n) + (Tp (n-1) | 
Ne! 


Because of symmetry, this is equivalent to 


mn 
ES = 1+ 25 (T41 ) “Ger ) 
Tm 
N=! 
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If we define i = n - 1, we have 


- yn =| 
DR i A I; 
which is equivalent to aoe 
ynel 
3) : 
Tre + DAE 
22 


Obviously, T, ig Be 

Now that the expected search time is known, relative cost can 
be computed on the same basis as used previously (equation (7)) 
for direct comparison. See Table 6 for a comparison of average 
search time, relative search time, and relative cost vs, the size 
of the filial sets, With a 50 percent increase in storage require- 
ments assumed in computing relative cost, the proposed method is 
superior when the average filial set size exceeds 9, If 100 percent 
storage increase is required, the break-even point is 16. [2 ] 

Another variation is to allow unbalanced trees, This type of 
tree may be utilized when it is not possible to freely manipulate 
keys and select filial sets to be near optimum size, The path 
lengths may then vary within the tree, and the tree may be con- 
structed using random length key words. In the case where the main 
storage is a disc, the proper path length is easily determined 
because each leaf must govern T items (where T is the number of 
items that may be accommodated on one track of the disc file), Thus, 
if a particular node governs T or fewer items, the branching along 
that particular path may be stopped, If, however, the node governs 
more than T items, the branching must be continued for at least 


one more level, 
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Frequently there are subtrees which are also chains, i. e. the 
filial set of each node in the subtree consists of a single node, 
For example, when using English language words as keys the commonly 
occurring suffixes such as ing, tion, ain, ed and ly seldom span 
more than one item. Such chains waste both storage space and search 
time. Indeed, if a node governs only a few items, instead of con- 
tinuing the tree in the normal manner from that node-- entailing the 
use of several levels to reach the items-- the items should be assigned 
to the nodes of the filial set of that node and no additional levels 
used, By doing this, fewer nodes are traversed to reach the leaves, 
thereby reducing both the required storage space and the search 
time, A variable length value field is required for these leaf 
nodes, however, because all the key elements of the item not already 
assigned in the partial key of the parent node must be assigned the 
values of the leaves, 

Thus, in this case it is of interest to determine at which nodes 
the branching should be terminated to achieve a more efficient 
search time and a reduced storage requirement. A path leading to 
a block of items is considered to have its optimum length if the 
expected search time for items in the block is increased when the 
path length is changed, The computation of this optimum path length 
is based on the number of items, g, which a node x governs and the 


number of nodes, m, in its filial set. The nodes of the filial 


> 


set of x each govern an average of g/m items. If the branching is 


discontinued at node x, the expected search time x is 
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ty =o (ctw 


If, however, the branching is continued for one more level, the 


expected search time from x is 


Thus, the expected search time from x will decrease if branching 
is continued when ty >t. ; | 37 ] 


Figure 15 displays the relation between t, and CW ¢ for any 


d 
node with its (g,m) lying in the pie-shaped region to the right 

of the solid curve, the search time can be decreased by continuing 

the branching from that node, The dashed curves indicate the rela- 
tive improvement in the search time between the cases of continuing 
and discontinuing the branching from the node. As g 2m, the 
criterion as to whether to branch or not given in Figure 15 may 

be approximated by stating that the branching should be continued 

from any node which governs more than six items, [ 37 ] 

Given a situation where the size of h and m are either ( or both ) 
required to be larger than optimum, we are then faced with a tree 
which is to some degree vacant, An analysis of the manipulations 
and searches associated with such a tree forces one to rely primarily 
on the mathematics of probability. Scidmore and Weinberg have 
developed an analysis of trees from this viewpoint. [33] The 
analysis includes the situations 1) where keywords are chosen 


randomly from those with lengths of one through nine elements, and 
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2) where all keywords are the same length. Values of m= 10, 26, 
and 50 were used to simulate decimal numbers, alphabetic characters, 
and alphanumeric characters respectively. 

In the case where the keywords were considered all the same 
length, the results of the analysis for the predicted mean search S 
correspond exactly with results previously derived for t (recall 
that in our previous development we considered t as a direct function 
of the number of item times, i. e. the number of links traversed, 

t is therefore identical to S). A plot of the mean search vs. the 
logarithm of the fraction of the tree in use is shown in Figure 16 
for several values of m and L (L here is equivalent toh). It is 
apparent that when there are very few sequences stored, the mean 
search is approximately L and somewhat independent of m, However, 
as the number of stored sequences approaches a (same as a ye 


the mean search approaches % L(m+ 1) as noted before, 
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oe Comparison of modified trees, 

A useful modification to the more straightforward tree struc- 
tures already described is the trie,’ Fredkin has proposed a 
scheme in which the heirs are chained and the sib sets are arranged 
in blocks, [3] The basic feature is that whenever a node on 
level i requires an heir on level i+ 1, a block large enough to 
contain an entire filial set is established. The heir address in 
level i is the address of the first logical member of the fiiial 
set and all other members of that set are understood to exist in 
sequence in the succeeding memory locations, Therefore, each 
computer word corresponding to a node need only contain the heir 
address since the node's label is implicit in its position within 
the sib set, The advantage in machines and applications where word 
size is a problem is obvious, Also, retrieval from a trie structure 
requires one indexing manipulation for each level of the tree, not 
an item-by-item search of the nodes as in the tree structure. 

Thus, the expected search time is proportional to the average path 
length, h, rather than %h (m+ 1). This speed advantage is com- 
pensated by requiring more storage locations, however. 

The trie structure is introduced at this point since it 
represents a middle-ground compromise between the two extremes in 
tree structures, i.e. the balanced tree and the completely elided 


unbalanced tree. It therefore provides a fwicrum for comparison 


it 
The word trie is taken from the word retrieval. 
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of the various types, See Figure 17 for a pictorial comparison 
between these three basic types under a storage condition where the 
majority of the balanced tree is vacant. 

Let d be the average number of elements utilized in each filial 
set where d is a subset of the m possible elements in a filial set, 
Note that in a balanced tree, d = m, Computing the number of 
storage locations W, we find: 


For a balanced tree 


For a trie 


W=m+dm _+ jae +... td m= m 


For an elided unbalanced tree 


Since d < n, 
ae - 1 ae =a Ae < oy: - J 
oe oS eae a De a 
: ee | d - 1] ne | 
or 


Welided < eerie < O yaaleneetd 
From this simple comparison, it is obvious that the ratio of 
d/m is a convenient index to the preferred tree structure, This 
index, however, is sensitive only to the storage criteria and 


neglects the equally important factor of search time, One of the 
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advantages of the trie is a faster search time so that it is even 
more attractive than the elided tree in certain instances, 

The balanced tree finds its place in applications such as the 
multi-list structure developed at the University of Pennsvivania. 

[18] The tree acts as a translator whose inputs are coded 
information words and whose outputs are addresses of lists on which 
items of information are stored. The keys, assumed to enter the 
system in random order, are inserted in the tree so that the mono- 
tonic order of key values is preserved and the balanced growth of 
the tree is continuously maintained, The leaves of the tree lead 
into the Multi-Association Area as lists, Each list is a string of 
items that have at least on key in common, 

A technique which is roughly equivalent to an h-level baianced 
tree is the addressing of multidimensional arrays as linear arrays 
or vectors, This technique provides a means for partitioning a 
file into subfiles which is simpler than the tree structure and 


also permits more rapid entry into a subfile than the tree, f14] [12] 
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a Fibonaccian searching, 

The standard binary search of an ordered list for a given 
argument consists of comparing the argument with the middle item of 
the list in order to isolate the item desired to a list half as long 
as the original list, This process is repeated until the item is 
found, As we have seen, the number of trials required is of order 
log, n for a table of n entries. If it is desired to save memory 
space, the addresses for the successive trials are computed, but 
this requires time consuming division or multiplication by an inverse 
in machines having no binary shift. A table of the powers of two 
is stored and used if a program which is conservative of time is 
desired, 

One method to conserve both time and storage space utilizes the 


Fibonacci numbers defined by: 


Kes 


The important distinction here is that the successive increments are 
found by subtraction, [6] 
If at some point in the process the item has been isolated 


to an interval of size u, beginning at A, compare x to GC, | 
i i-l where 
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x = value of the argument 


Cc. = contents of At+u, 


i-l i-1l 
Lt x<C. > the item is now isolated to an interval of 
size Us 4 beginning a A, hence replace i by 


i-l and repeat the process. 


Pics >C, ? the item is now isolated to an interval of 


size u, - 


; u, =u beginning at 
if i-l i-2 & 5 


A+ Us. 


and repeat the process. 


It is easily verified by induction that 


, oe 


where gd b+ Vs! 





HOt Laree 1, 


k-~ — ene = an Ad 
n Je? ? d,. > D=-f/ CB 


and 





—> P= 2.618 10 
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so replace i by i - 2, A by At Wey 





In the worst case, the reduction factor R, (defined as the ratio 


of successive intervals) is pd , requiring 


4a, m = coos Pe Zrvials. 
> Lag P 


: A ae 
In the first case (if x is not found), R, iS p » requiring 


La Ate = me Zrials 
¢ A bry P 


+ qR 





Clearly, the expected value of Ris pR where 


] Que? 


p is the probability of R, and q is the probability of R 


iI 2 e 


| | 
= and so =—_— 
tf" % bt 
giving 
-(Lip+ 2(H42 2 
R= (3 B (F) 
as in the binary search, 
Therefore, for searching an ordered list having n elements, the 
expected searching time is of order log, n and the maximum searching 


time is of order ree he [6] 
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CLOSED FORMS OF SUMMATIONS .- ”— 
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All of these expressions may be verified by induction. For . 
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